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Abstract 

Deep learning in particular Convolutional Neural Net- 
work (CNN), has achieved promising results in face recog¬ 
nition recently. However, it remains an open question: why 
CNNs work well and how to design a 'good' architecture. 
The existing works tend to focus on reporting CNN archi¬ 
tectures that work well for face recognition rather than in¬ 
vestigate the reason. In this work, we conduct an extensive 
evaluation of CNN-based face recognition systems (CNN- 
FRS) on a common ground to make our work easily repro¬ 
ducible. Specifically, we use public database LFW (Labeled 
Faces in the Wild) to train CNNs, unlike most existing CNNs 
trained on private databases. We propose three CNN archi¬ 
tectures which are the first reported architectures trained 
using LFW data. This paper quantitatively compares the 
architectures of CNNs and evaluates the effect of different 
implementation choices. We identify several useful prop¬ 
erties of CNN-FRS. For instance, the dimensionality of the 
learned features can be significantly reduced without ad¬ 
verse effect on face recognition accuracy. In addition, a 
traditional metric learning method exploiting CNN-learned 
features is evaluated. Experiments show two crucial factors 
to good CNN-FRS performance are the fusion of multiple 
CNNs and metric learning. To make our work reproducible, 
source code and models will be made publicly available. 

1. Introduction 

The conventional face recognition pipeline consists of 
four stages: face detection, face alignment, feature extrac¬ 
tion (or face representation) and classification. Perhaps the 
single most important stage is feature extraction. In con¬ 
strained environments, the hand-crafted features such as Lo¬ 
cal Binary Patterns (LBP) m and Local Phase Quantisa- 
tion (LPQ) Eia have achieved respectable face recognition 

* These authors contributed equally to this work 


performance. However, the performance using these fea¬ 
tures degrades dramatically in unconstrained environments 
where face images cover complex and large intra-personal 
variations such as pose, illumination, expression and oc¬ 
clusion. It remains an open problem to find an ideal fa¬ 
cial feature which is robust for face recognition in uncon¬ 
strained environments (FRUE). In the last three years, con¬ 
volutional neural network (CNN) rebranded as ‘deep learn¬ 
ing’ has achieved very impressive results on FRUE. Un¬ 
like the traditional hand-crafted features, the CNN learning- 
based features are more robust to complex intra-personal 
variations. More notably, the top three face recognition 
rates reported on the FRUE benchmark database LFW (La¬ 
beled Faces in the Wild) ca have been achieved by CNN 
methods EH 122 [m. The success of the latest CNNs on 
FRUE and more general object recognition task MEM 
stems from the following facts: (1) much larger labeled 
training sets are available; (2) GPU implementations greatly 
reduce the time of training a large CNN; (3) CNNs greatly 
improve the model generation capacity by introducing ef¬ 
fective regularisation strategies, such as dropout HOl . 

Despite the promising performance achieved by CNNs, 
it remains unclear how to design a ‘good’ CNN architecture 
to adapt to a specific classification task due to the lack of 
theoretical guidance. However, some insights into CNN de¬ 
sign can be gained by experimental comparisons of different 
CNN architectures. The work O made such comparisons 
and comprehensive analysis for the task of object recogni¬ 
tion. However, face recognition is very different from ob¬ 
ject recognition. Specifically, faces are aligned via 2D sim¬ 
ilarity transformation or 3D pose correction to a fixed ref¬ 
erence position in images before feature extraction while 
object recognition usually does not conduct such align¬ 
ment, and therefore objects appear in arbitrary positions. 
As a result, the CNN architectures used for face recog¬ 
nition ED m El EH EH are rather different from those 
for object recognition 01 [IS EH 13. For the task of face 



recognition, it is important to make a systematic evaluation 
of the effect of different CNN design and implementation 
choices. In addition, those published CNNs 1211 [^[^1^ 
are trained in different face databases, most of which are 
not publicly available. The difference of training sets might 
result in unfair comparisons of CNN architectures. To avoid 
this unfairness, the comparison of different CNNs should be 
conducted on a common ground. 

To clarify the contributions of different components of 
CNN-based face recognition systems, in this paper, a sys¬ 
tematic evaluation is conducted. To make our work repro¬ 
ducible, all the networks evaluated are trained on the pub¬ 
licly available LFW database. Specifically, our contribu¬ 
tions are as follows: 

• Different CNN architectures including number of fil¬ 
ters and layers are compared. In addition, we evaluate 
the impact of multiple network fusion introduced by 

ED 

• Various implementation choices, such as data augmen¬ 
tation, pixel value type (colour or grey) and similarity, 
are evaluated. 

• We quantitatively analyse how downstream metric 
learning methods such as joint Bayesian i) can boost 
the effectiveness of the CNN-learned features. 

• Finally, source code for our CNN architectures and 
trained networks will be made publicly available (the 
training data is already public). This provides an ex¬ 
tremely competitive baseline for face recognition to 
the community. To our knowledge, we are the first to 
publish fully reproducible CNNs for face recognition. 

2. Related Work 

CNN methods have drawn considerable attention in the 
field of face recognition in recent years. In particular, CNNs 
have achieved impressive results on FRUE. In this section, 
we briefiy review these CNNs. 

The researchers in Facebook AI group trained an 8-layer 
CNN named DeepFace 1^ . The first three layers are con¬ 
ventional convolution-pooling-convolution layers. The sub¬ 
sequent three layers are locally connected, followed by 2 
fully connected layers. Pooling layers make learned fea¬ 
tures robust to local transformations but result in missing 
local texture details. Pooling layers are important for object 
recognition since the objects in images are not well aligned. 
However, face images are well aligned before training a 
CNN. It is claimed in (ISj that one pooling layer is a good 
balance between local transformation robustness and pre¬ 
serving texture details. DeepFace is trained on the largest 
face database to-date which contains four million facial im¬ 
ages of 4,000 subjects. Another contribution of ll25]| is the 


3D face alignment. Traditionally, face images are aligned 
using 2D similarity transformation before they are fed into 
CNNs. However, this 2D alignment cannot handle out-of- 
plane rotations. To overcome this limitation, ||25]| proposes 
a 3D alignment method using an affine camera model. 

In 1211 . a CNN-based face representation, referred to as 
Deep hidden IDentity feature (DeepID), is proposed. Un¬ 
like DeepFace whose features are learned by one single big 
CNN, DeepID is learned by training a collection of small 
CNNs (network fusion). The input of one single CNN is 
the crops/patches of facial images and the features learned 
by all CNNs are concatenated to form a powerful feature. 
Both RGB and grey crops extracted around facial points are 
used to train the DeepID. The length of DeepID is 2 (RGB 
and Grey images) x 60 (crops) x 160 (feature length of 
one network) = 19,200. One small network consists of 4 
convolutional layers, 3 max pooling layers and 2 fully con¬ 
nected layers shown in Table DeepID uses identification 
information only to supervise the CNN training. In com¬ 
parison, DeepID2 03, an extension of DeepID, uses both 
identification and verification information to train a CNN, 
aiming to maximise the inter-class difference but minimise 
the intra-class variations. To further improve the perfor¬ 
mance of DeepID and DeepID2, DeepID2-i- (2211 is pro¬ 
posed. DeepID2-F adds the supervision information to all 
the convolutional layers rather than the topmost layers like 
DeepID and DeepID2. In addition, DeepID2-F improves the 
number of filters of each layer and uses a much bigger train¬ 
ing set than DeepID and DeepID2 . In (22l, it is also dis¬ 
covered that DeepID2-F has three interesting properties: be¬ 
ing sparse, selective and robust. 

The work (2^ proposes another face recognition 
pipeline, refereed to as WebFace, which also learns the face 
representation using a CNN. WebFace collects a database 
which contains around 10,000 subjects and 500,000 images 
and makes this database publicly available. Motivated by 
very deep architectures of |[I11[23, WebFace trains a much 
deeper CNN than those ||2Tl[T9l[22l[25l used for face recog¬ 
nition as shown in Table Specifically, WebFace trains 
a 17-layer CNN which includes 10 convolutional layers, 5 
pooling layers and 2 fully connected layers detailed in Ta¬ 
ble [2 Note that the use of very small convolutional filters 
(3x3), which avoids too much texture information decrease 
along a very deep architecture, is crucial to learn a powerful 
feature. In addition, WebFace stacks two 3x3 convolutional 
layers (without pooling in between) which is as effective as 
a 5 X 5 convolutional layer but with fewer parameters. 

Table f^ompares three typical CNNs (DeepFace (25l, 
DeepID 1^ . WebFace (28l). It is clear that their architec¬ 
tures and implementation choices are rather different, which 
motivates our work. In this study, we make systematic eval¬ 
uations to clarify the contributions of different components 
on a common ground. 


Table 1. Comparisons of 3 Published CNNs 



Input Image ^ 

Architecture ^ 

No. of para. 

Patch 

Fusion 

Feature 

Length 

Training set 

DeepFace 1^ 

152x152x3 

Cl:32xllxll^M2, C3:16x9x9, 

L4: 16x9x9, L5:16x7x7, L6:16x5x5, 
F7, F8 

120M+ 

No 

4096 

120M+ images 
4K+ subjects 

DeepID (2T] 

39x31x{3,l} 

31x31x{3,l} 

Cl:20x4x4, M2, C3:40x3x3, M4, 
C5:60x3x3, M6, C7:80x2x2, 

F8, F9 

101M+ 

Yes 

19200 

202K+ images 
10K+ subjects 

WebFace OH 

100x100x1 

Cl:32x3x3, C2:64x3x3, M3, 
C4:64x3x3, C5:128x3x3, M6, 
C7:96x3x3, C8:192x3x3, M9, 
C10:128x3x3, Cll:256x3x3, M12, 
C13:160x3x3, C14:320x3x3, A15, 
F16,F17 

5M+ 

No 

320 

986K+ images 
lOK subjects 


^ The input image is represented as widthx heightx channels. 1 and 3 mean grey or RGB images respectively. 

^ The capital letters C, M, L, A, F represent convolutional, max pooling, locally connected, average pooling and fully connected layers 
respectively. These capital letters are followed by the indices of CNN layers. 

^ The number of hlters and hlter size are denoted as ‘num x size x size’ 


3. Methodology 


LFW is the de facto benchmark database for FRUE. 
Most exisiting CNNs 1^ [211 [T3 |22l train their networks 
on private databases and test the trained models on LFW. 
In comparison, we train our CNNs only using LFW data to 
make our work easily reproducible. In this way, we cannot 
directly use the reported CNN architectures ll2^[^[T9l[^ 
ESl since our training data is much less extensive. We intro¬ 
duce three architectures adapting to our training set in sub¬ 
section!^^ To further improve the discrimination of CNN- 


learned features, metric learning method is usually used. 
One metric learning method, Joint Bayesian model (61, is 
detailed in subsection 13.21 


3.1. CNN Architectures 

How to design a ‘good’ CNN architecture remains an 
open problem. Generally, the architecture depends on the 
size of training data. Less data should drive a smaller net¬ 
work (fewer layers and filters) to avoid overfitting. In this 
study, the size of our training data is much smaller than that 
used by the state of the art methods (25l [2T1 [191 1221 1211; 
therefore, smaller architectures are designed. 

We propose three CNN architectures adapting to the size 
of training data in LFW. These architectures are of three dif¬ 
ferent sizes: small (CNN-S), medium (CNN-M), and large 
(CNN-L). CNN-S and CNN-M have 3 convolutional lay¬ 
ers and two fully connected layers, while CNN-M has more 
filters than CNN-S. Compared with CNN-S and CNN-M, 
CNN-L has 4 convolutional layers. The activation func¬ 
tion we used is REctification Linear Unit (RELU) 1741 . In 
our experiments, dropout nni does not improve the pre- 
formance of our CNNs, therefore, it is not applied to our 
networks. Following Emm, softmax function is used in 


the last layer for predicting a single class of K (the number 
of subjects in the context of face recognition) mutually ex¬ 
clusive classes. During training, the learning rate is set to 
0.001 for three networks, and the batch size is fixed to 100. 
Table [2] details these three architectures. 


Table 2. Our CNN Architectures 


CNN-S 

CNN-M 

CNN-L 

convl 

12 X 5 X 5 

St. 1, pad 0 
x2 pool 

16 X 5 X 5 

St. 1, pad 0 
x2 pool 

16 X 3 X 3 

St. 1, pad 1 

conv2 

24 X 4 X 4 

St. 1, pad 0 
x2 pool 

32 X 4 X 4 

St. 1, pad 0 
x2 pool 

16 X 3 X 3 

St. 1, pad 1 
x2 pool 

conv3 

32 X 3 X 3 

St. 2, pad 0 
x2 pool 

48 X 3 X 3 

St. 2, pad 0 
x2 pool 

32 X 3 X 3 

St. 1, pad 1 
x3 pool, St. 2 

conv4 

- 

- 

48 X 3 X 3 

St. 1, pad 1 
x2 pool 

fully connected 

160 

160 

160 

4000, softmax 

4000, softmax 

4000, softmax 


Convolutional layer is detailed in 3 sub-rows: the 1st indicates the 
number of filters and filter size as ‘num x size x size’; the 2nd 
specifies the convolutional stride (‘st.’) and padding (‘pad’); and 
the 3rd specifies the max-pooling downsampling factor. For fully 
connected layers, we specify their dimensionality: 160 for feature 
length and 4000 for the number of class/subjects. Note that every 
9 splits (training set) of LFW have different number of subjects, 
but all around 4000. 





































3.2. Metric Learning 

Metric Learning (MeL), which aims to find a new metric 
to make two classes more separable, is often used for face 
verification. MeL is independent of the feature extraction 
process and any feature (hand-crafted and learning-based) 
can be fed into a MeL method. Joint Bayesian (JB) O 
model is a well-known MeL method and it is the most 
widely used MeL method which is applied to the features 
learned by CNNs ED [IS EH. 

JB models the face verification task as a Bayesian de¬ 
cision problem. Let Hj and He represent intra-personal 
(matched) and extra-personal (unmatched) hypotheses, re¬ 
spectively. Based on the MAP (Maximum a Posteriori) rule, 
the decision is made by: 


r{xi,X2) = log 


P{xi,X 2 I Hi) 
P{xi,X2 I He) 


( 1 ) 


where xi and X 2 are features of one face pair. It is assumed 
that P(xi,X 2 I Hi) and P(xi,X 2 | He) have Gaussian 
distributions A^(0, P/) and A^(0, P^;), respectively. 

Before discussing the way of computing Si and Se,^^ 
first explain the distribution of a face feature. A face x is 
modelled by the sum of two independent Gaussian variables 
(identity /i and intra-personal variations s): 


X = /i + 5 


( 2 ) 


fi and 5 follow two Gaussian distributions N{0,S^) and 
A^(0, Pe), respectively. and are two unknown covari¬ 
ance matrices and they are regarded as face prior. For the 
case of two faces, the joint distribution of {xi,X 2 } is also 
assumed as a Gaussian with zero mean. Based on Eq. Q, 
the covariance of two faces is: 


C0V(X1,X2) = COv(/ii,/i2) +COv(5i,52) (3) 

Then Si and Se can be derived as: 


and 


Si 


S, 


S, 

S^ + Ss 


Se = 


S^^S, 

0 


0 


(4) 

(5) 


Clearly, r(xi, X 2 ) in Eq. Q only depends on and S^, 
which are learned from data using an EM algorithm i). 


4 . Evaluation 

LEW contains 5,749 subjects and 13,233 images and the 
training and test sets are defined in iT2l Eor evaluation, 
LEW is divided into 10 predefined splits for 10-fold cross 
validation. Each time nine of them are used for model train¬ 
ing and the other one (600 image pairs) for testing. LEW 



unmatched 



matched 


Figure 1. Cropped sample images in LFW 


defines three standard protocols {unsupervised, restricted 
and unrestricted) to evaluate face recognition performance. 
'Unrestricted’ protocol is applied here because the infor¬ 
mation of both subject identities and matched/unmatched 
labels is used in our system. The face recognition rate is 
evaluated by mean classification accuracy and standard er¬ 
ror of the mean. 

The images we used are aligned by deep funneling CD. 
Each image is cropped to 58 x58 based on the coordiates of 
two eye centers. Some sample crops are visualised in Eig.[^ 
It is commonly believed that data augmentation can boost 
the generalisation capacity of a neural network; therefore, 
each image is horizontally Hipped. The mean of the images 
is subtracted before network training. The open source im¬ 
plementation MatConvNet ED is used to train our CNNs. 
In this section, different components of our CNN-based face 
recognition system are evaluated and analysed. 

Architectures Choosing a ‘good’ architecture is crucial 
for CNN training. Overlarge or extremely small networks 
relative to the training data can lead to overfitting or under¬ 
fitting, in the case of which the network does not converge 
at all during training. In this comparison, the RGB colour 
images are fed into CNNs and feature distance is measured 
by cosine distance. The performances of the three architec¬ 
tures are compared in Table CNN-M achieves the best 
face recognition performance, indicating that the CNN-M 
generalises best among these three architectures using only 
LEW data. Erom this point, all the other evaluations are 
conducted using CNN-M. The face recognition rate 0.7882 
of CNN-M is considered as the baseline, and all the remain¬ 
ing investigations will be compared with it. 

Table 3. Comparisons of Our Three Architectures 


Model 

Accuracy 

CNN-S 

0.7828±0.0046 

CNN-M 

0.7882±0.0037 

CNN-L 

0.7807±0.0035 


Feature Distance The exisiting research offers little dis¬ 
cussion about the distance measurement for CNN-learned 
features. In particular, it is interesting to know what is the 















Table 4. Distance Comparison 


Distance 

Accuracy 

euclidean 

0.6898±0.0092 

city block 

0.6892±0.0088 

chebychev 

0.6692±0.0088 

cosine 

0.7882±0.0037 

correlation 

0.7882±0.0040 

spearman 

0.7878±0.0031 


best distance measure for face recognition. Table com¬ 
pares the impact of six distance measures on face recog¬ 
nition accuracy. Cosine and correlation achieve the best 
recognition rates, however, the standard deviation of cosine 
is smaller than that of correlation. Therefore, cosine dis¬ 
tance is the best among these distances. 

Grey vs Colour In (28) and Cg, CNNs are trained using 
grey-level and RGB colour images, respectively. In com¬ 
parison, both grey and colour images are used in ED. We 
quantitatively compare the impact of these two images types 
on face recognition. Their comparative evaluation yields 
face recognition accuracies using grey and colour images 
of 0.7830±0.0077 and 0.7882±0.0118, respectively. The 
performances using grey and colour images are very close 
to each other. Although colour images contain more infor¬ 
mation, they do not deliver a significant improvement. 

Data Augmentation Flip, mirroring images horizontally 
producing two samples from each, is a commonly used data 
augmentation technique for face recognition. Both original 
and mirrored images are used for training in all our evalu¬ 
ations. However, little discussion in the existing work was 
made to analyse the impact of image hipping during test¬ 
ing. Naturally, the test images can also be mirrored. A pair 
of test images can produce 2 new mirrored ones. These 4 
images can generate 4 pairs instead of one original pair. To 
combine these 4 images/pairs, two fusion strategies (feature 
and score fusion) are implemented in this work. For feature 
fusion, the learned features of a test image and its mirrored 
one are concatenated to one feature, which is then used for 
score computing. For score fusion, 4 scores generated from 
4 pairs are averaged to one score. Table compares the 
three scenarios: no flip during the test, feature and score 
fusions. As is shown in Table mirroring images does im¬ 
prove the face recognition performance. In addition, feature 
fusion works slightly better than score fusion, however, the 
improvements are not statistically significant. 

Table 5. Comparison of Data Augmentation during Test 



Accuracy 

no flip on test set 

0.7882 ± 0.0037 

feature fusion 

0.7895 ± 0.0036 

score fusion 

0.7893 ± 0.0035 



Figure 2. The impact of feature dimensionality in PC A space on face 
recognition rate 

Learned Feature Analysis It is interesting to investigate 
the properties of CNN-learned face representations. First, 
we discuss feature normalisation, which standardises the 
range of features and is generally performed during the 
data preprocessing step. For example, to implement eigen- 
face 1^ . the features (pixel values) are usually normalised 
via Eq. before training a PC A space. 


where x G i? and x G i? are original and normalised fea¬ 
ture vectors, respectively, /ix and ctx are the mean and 
standard deviation of x. Motivated by this, our CNN fea¬ 
tures are normalised by Eq. before computing cosine 
distance. The accuracies with and without normalisation 
are 0.7927±0.0126 and 0.7882±0.0118, respectively. Thus 
normalisation is effective to improve recognition rate. 

Second, we perform dimensionality reduction on the 
learned 160D features using PC A. As shown in Eigure 
only 16 dimensions of the PC A feature space can achieve 
comparable face recognition rates to those of the original 
space. It is a very interesting property of CNN-learned fea¬ 
tures because low dimensionality can significantly reduce 
storage space and computation, which is crucial for large 
scale applications or mobile devices such as smartphone. 

Network Fusion The work DeepID 1211 and its vari¬ 
ants |[T9j[22l apply the fusion of multiple networks. Specif¬ 
ically, the images of different facial regions and scales are 
separately fed into the networks that have the same archi¬ 
tecture. The features learned from different networks are 
concatenated to a powerful face representation, which im¬ 
plicitly captures the spatial information of facial parts. The 
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Figure 3. Sample crops in LFW. Rows correspond to 5 regions from 4 
corners and center; Columns correspond to 6 scales. 


size of these images can be different as shown in Table 
In II 2 TII . 120 networks are trained separately for this fu¬ 
sion. However, it is not very clear how greatly this fusion 
improves the face recognition performance. To clarify this 
issue, we implement the network fusion. 

We extract d x d crops from four corners and cen¬ 
ter and then upsample them to the original image size 
58 X 58. The crops have 6 different scales: d = floor {bS x 
{0.3,0.4,0.5,0.6,0.7,0.8}), where floor is the operator to 
get the integer part. Therefore we obtain 30 local patches 
with size of 58 x 58 from one original image. Figure 
shows these 30 crops. To evaluate the performance of net¬ 
work fusion, we separately train 30 different networks us¬ 
ing these crops. Then one face image can be represented by 
concatenating the features learned from different networks. 
Table compares the performance of single network and 
network fusion. Note that we choose 16 best networks of 
30 ones for the fusion. It is clear that network fusion works 
much better than a single network. Specifically, the fusion 
of 16 best networks improves the face recognition accuracy 
of single network by 4.51%. Clearly, the face representa¬ 
tion of network fusion is actually the fusion of features of 
different facial componets and scales. Similar ideas have 
widely been used to improve the facial representation ca¬ 
pacity of hand-crafted features such as multi-scale local bi¬ 
nary pattern Ill6l, multi-scale local phase quantisation (U 
and high-dimensional local features (7). 


Table 6. Comparison of Network Fusion 



Accuracy 

single network 

0.7882 ± 0.0037 

network fusion 

0.8333 ± 0.0042 


0.7 L. ' . ' . ' .'.i. i . i . i .^ 

123456789 10 

the split index in view 2 of LFW 

Figure 4. Face recognition accuracies with and without JB 

Metric Learning For metric learning, the features of the 
fusion of best 16 networks are used. The feature dimension¬ 
ality (2560=160x 16) is reduced to 320 via PCA before they 
are fed into JB. Figure [^compares the face recognition ac¬ 
curacies with and without JB in each split of LFW database. 
JB consistently and significantly improves the face recogni¬ 
tion rates, showing the importance of metric learning. 

Table |7] compares our method with non-commercial 
state-of-the-art methods. The performance of our method 
is slightly better than ||24l|8l|T5l but worse than fT HTTlI^ . 
However, the feature dimensionality of 171 [iTl is much 
higher than ours. In |[20l . a large number of new pairs are 
generated in addition to those provided by LFW to train the 
model, while we do not generate new pairs. 

Table 7. Comparison with state-of-the-art methods on LFW under 
‘unrestricted, label-free outside data’_ 


methods 

accuracy 

LBP multishot II24II 

0.8517 ±0.0061 

LDML-MkNN (SI 

0.8750 ± 0.0040 

LBP-l-PLDA (13 

0.8733 ± 0.0055 

high-dim LBP (3 

0.9318 ±0.0107 

Fisher vector faces 1171 

0.9303 ± 0.0105 

ConvNet-nRBM (20| 

0.9175 ± 0.0048 

Network fusion +JB 

0.8763 ± 0.0064 


5. Conclusions 

Recently, convolutional neural networks have attracted 
a lot of attention in the field of face recognition. In this 
work, we present a rigorous empirical evaluation of CNN- 
based face recognition systems. Specifically, we quantita¬ 
tively evaluate the impact of different architectures and im¬ 
plementation choices of CNNs on face recognition perfor- 




























mances on common ground. We have shown that network 
fusion can significantly improve the face recognition perfor¬ 
mance because different networks capture the information 
from different regions and scales to form a powerful face 
representation. In addition, metric learning such as Joint 
Bayesian method can improve the face recognition greatly. 

Since network fusion and metric learning are the two 
most important factors affecting CNN performance, they 
will be the subject of future investigation. 

Acknowledgements This work is supported by the Eu¬ 
ropean Union’s Horizon 2020 research and innovation 
program under grant agreement No 640891, EPSRC/dstl 
project ‘Signal processing in a networked battlespace’ un¬ 
der contract EP/KO14307/1, EPSRC Programme Grant 
‘S3A: Euture Spatial Audio for Immersive Listener Experi¬ 
ences at Home’ under contract EP/L000539, and the Euro¬ 
pean Union project BEAT. We also gratefully acknowledge 
the support of NVIDIA Corporation for the donation of the 
GPUs used for this research. 

References 

[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face recognition 
with local binary patterns. In Computer vision-eccv 2004, 
pages 469-481. Springer, 2004. 

[2] T. Ahonen, E. Rahtu, V. Ojansivu, and J. Heikkila. Reeog- 
nition of blurred faces using local phase quantization. In 
Pattern Recognition, 2008. ICPR 2008. 19th International 
Conference on, pages 1-4. IEEE, 2008. 

[3] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietikainen. Mul¬ 
tiscale local phase quantization for robust component-based 
face recognition using kernel fusion of multiple descriptors. 
Pattern Analysis and Machine Intelligence, IEEE Transac¬ 
tions on, 35(5): 1164-1177, 2013. 

[4] C. H. Chan, M. A. Tahir, J. Kittler, and M. Pietikainen. Mul¬ 
tiscale local phase quantization for robust component-based 
face recognition using kernel fusion of multiple descriptors. 
Pattern Analysis and Machine Intelligence, IEEE Transac¬ 
tions on, 35(5): 1164-1177, 2013. 

[5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. 
Return of the devil in the details: Delving deep into eonvo- 
lutional nets. arXiv preprint arXiv:1405.3531, 2014. 

[6] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian faee 
revisited: A joint formulation. In Computer Vision-ECCV 
2012, pages 566-579. Springer, 2012. 

[7] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimension¬ 
ality: High-dimensional feature and its efficient compression 
for faee verifieation. In Computer Vision and Pattern Recog¬ 
nition (CVPR), 2013 IEEE Conference on, pages 3025-3032. 
IEEE, 2013. 

[8] M. Guillaumin, J. Verbeek, and C. Sehmid. Is that you? met¬ 
ric learning approaches for face identification. In Computer 
Vision, 2009 IEEE 12th International Conference on, pages 
498-505. IEEE, 2009. 


[9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into 
rectifiers: Surpassing human-level performance on imagenet 
classifieation. arXiv preprint arXiv:1502.01852, 2015. 

[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and 
R. R. Salakhutdinov. Improving neural networks by pre¬ 
venting CO- adaptation of feature detectors. arXiv preprint 
arXiv:1207.0580, 2012. 

[11] G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller. 
Learning to align from scratch. In Advances in Neural In¬ 
formation Processing Systems, pages 764-772, 2012. 

[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. 
Labeled faees in the wild: A database for studying face 
recognition in unconstrained environments. Technical Re¬ 
port 07-49, University of Massachusetts, Amherst, October 
2007. 

[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating 
deep network training by reducing internal covariate shift. 
arXiv preprint arXiv:1502.03167, 2015. 

[14] A. Krizhevsky, 1. Sutskever, and G. E. Hinton. Imagenet 
classifieation with deep eonvolutional neural networks. In 
Advances in neural information processing systems, pages 
1097-1105,2012. 

[15] P. Li, Y. Fu, U. Mohammed, J. H. Elder, and S. J. Prince. 
Probabilistic models for inferenee about identity. Pattern 
Analysis and Machine Intelligence, IEEE Transactions on, 
34(1): 144-157, 2012. 

[16] S. Liao, X. Zhu, Z. Lei, L. Zhang, and S. Z. Li. Learning 
multi-scale block local binary patterns for face reeognition. 
In Advances in Biometrics, pages 828-837. Springer, 2007. 

[17] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. 
Fisher Vector Faces in the Wild. In British Machine Vision 
Conference, 2013. 

[18] K. Simonyan and A. Zisserman. Very deep convolutional 
networks for large-scale image recognition. arXiv preprint 
arXiv:1409.1556, 2014. 

[19] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning 
face representation by joint identification-verification. In 
Advances in Neural Information Processing Systems, pages 
1988-1996, 2014. 

[20] Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for 
face verification. In Computer Vision (ICCV), 2013 IEEE 
International Conference on, pages 1489-1496. IEEE, 2013. 

[21] Y. Sun, X. Wang, and X. Tang. Deep learning face represen¬ 
tation from predicting 10,000 classes. In Computer Vision 
and Pattern Recognition (CVPR), 2014 IEEE Conference on, 
pages 1891-1898. IEEE, 2014. 

[22] Y. Sun, X. Wang, and X. Tang. Deeply learned face repre¬ 
sentations are sparse, selective, and robust. arXiv preprint 
arXiv:1412.1265, 2014. 

[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, 
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi¬ 
novich. Going deeper with convolutions. arXiv preprint 
arXiv:1409.4842, 2014. 

[24] Y. Taigman, L. Wolf, T. Hassner, et al. Multiple one-shots 
for utilizing class label information. In BMVC, pages 1-12, 
2009. 



[25] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: 
Closing the gap to human-level perfonnance in face verifica¬ 
tion. In Computer Vision and Pattern Recognition (CVPR), 
2014 IEEE Conference on, pages 1701-1708. IEEE, 2014. 

[26] M. A. Turk and A. R Pentland. Eace recognition using eigen- 
faces. In Computer Vision and Pattern Recognition, 1991. 
Proceedings CVPR’9L, IEEE Computer Society Conference 
on, pages 586-591. IEEE, 1991. 

[27] A. Vedaldi and K. Lenc. Matconvnet - convolutional neural 
networks for matlab. CoRR, abs/1412.4564, 2014. 

[28] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen¬ 
tation from scratch. arXiv preprint arXiv:1411.7923, 2014. 

[29] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: 
Touching the limit of Ifw benchmark or not? arXiv preprint 
arXiv:1501.04690, 2015. 



